The impact of covid 19 on the minimum nights of stay for airbnb’s in Amsterdam

Team 1

Introduction

Introduction

For the project of the course Data Preparation and Workflow Management at Tilburg University, we decided to analyze the Airbnb market in Amsterdam and especially if the COVID-19 pandemic had an influence on the required minimum nights of stay. Especially since there are contradicted foundings in literature regarding this subject. A recent article by the New York Times suggested that the minimum nights of stay increased in New York City during the COVID-19 pandemic, whereas research by Kourtit et al. concluded that the minimum night requirements actually decreased during the pandemic. We decided to take a further look at these contradictions, by researching this subject. We collected data from Airbnb in Amsterdam, from 2020 as well as 2022, to see if there is any significant difference in the minimum nights of stay between during and after the COVID-19 pandemic.

Motivation

Motivation

Samengevoegd met Intro

Research Method

Research Method (1)

We decided to run a linear regression on the variables of interest. The dependent variable, the required minimum nights, is a metric variable and the independent variable, the presence of COVID-19 (present vs. absent) is a non-metric variable. We have data from 2020 and 2022 for 3960 different Airbnb listings (in total 7920 observations). The variable gets the value 1 assigned if the data is from 2020, so when there was COVID-19 in the Netherlands. Following from that, the variable gets the value 0 assigned if the data is from 2022, when the COVID-19 pandemic no longer had far-reaching consequences in the Netherlands. We decided to not only include the minimum nights of stay and the presence of COVID-19, but also added some control variables to our analysis, to see if there are other effects that might play a role. Since these control variables are differing in metric and non-metric variables, we have chosen linear regression over an ANOVA-analysis.

Research method (2)

Next to the dependent variable, the minimum_nights, and the independent variable covid, we included some control variables in a first regression. The control variables neighbourhood_num and roomtype_num were converted to factors, in which each number represents a different neighbourhood or roomtype. Next to that, accomodates, price and instant_bookable were included in this regression.

# Estimate simple model
m1 <- lm(minimum_nights ~ covid + as.factor(neighbourhood_num) + as.factor(roomtype_num) + accommodates + price + instant_bookable, df_cleaned)
summary(m1)

Call:
lm(formula = minimum_nights ~ covid + as.factor(neighbourhood_num) + 
    as.factor(roomtype_num) + accommodates + price + instant_bookable, 
    data = df_cleaned)

Residuals:
   Min     1Q Median     3Q    Max 
-4.125 -1.208 -0.465  0.420 56.063 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     1.8135725  0.4647861   3.902 9.62e-05 ***
covidTRUE                      -0.3726339  0.0650950  -5.724 1.08e-08 ***
as.factor(neighbourhood_num)2  -0.1691364  0.4187420  -0.404 0.686286    
as.factor(neighbourhood_num)3   0.5684251  0.4020637   1.414 0.157469    
as.factor(neighbourhood_num)4   1.8470679  0.5182352   3.564 0.000367 ***
as.factor(neighbourhood_num)5   0.1645501  0.4031250   0.408 0.683148    
as.factor(neighbourhood_num)6   0.2245008  0.4908163   0.457 0.647394    
as.factor(neighbourhood_num)7   0.0232328  0.4525262   0.051 0.959056    
as.factor(neighbourhood_num)8   0.3722813  0.4200383   0.886 0.375481    
as.factor(neighbourhood_num)9   0.2608471  0.4156185   0.628 0.530276    
as.factor(neighbourhood_num)10  0.0958234  0.4532241   0.211 0.832560    
as.factor(neighbourhood_num)11  0.0529102  0.4089200   0.129 0.897052    
as.factor(neighbourhood_num)12  0.3786304  0.8039026   0.471 0.637661    
as.factor(neighbourhood_num)13  0.2277393  0.5198490   0.438 0.661335    
as.factor(neighbourhood_num)14  0.1334606  0.3999017   0.334 0.738589    
as.factor(neighbourhood_num)15  0.1186491  0.3992311   0.297 0.766326    
as.factor(neighbourhood_num)16  1.4707543  0.5527789   2.661 0.007815 ** 
as.factor(neighbourhood_num)17  0.4515336  0.4314885   1.046 0.295383    
as.factor(neighbourhood_num)18  0.0214983  0.4353634   0.049 0.960618    
as.factor(neighbourhood_num)19 -0.7940233  0.5832449  -1.361 0.173430    
as.factor(neighbourhood_num)20  0.6924107  0.4121391   1.680 0.092989 .  
as.factor(neighbourhood_num)21  0.4857487  0.4348607   1.117 0.264019    
as.factor(neighbourhood_num)22  0.0988388  0.4109316   0.241 0.809931    
as.factor(roomtype_num)2        1.9593214  0.2433889   8.050 9.48e-16 ***
as.factor(roomtype_num)3        0.6605957  0.2421359   2.728 0.006382 ** 
as.factor(roomtype_num)4       -0.0888187  0.4806452  -0.185 0.853398    
accommodates                   -0.0129177  0.0260161  -0.497 0.619535    
price                          -0.0011770  0.0003356  -3.507 0.000456 ***
instant_bookableTRUE           -0.4890019  0.0768799  -6.361 2.12e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.81 on 7891 degrees of freedom
Multiple R-squared:  0.07123,   Adjusted R-squared:  0.06793 
F-statistic: 21.61 on 28 and 7891 DF,  p-value: < 2.2e-16

Results

Results

Following from the output from the first regression, we can conclude that a lot of the estimates are not significant in this model.


Call:
lm(formula = minimum_nights ~ covid + as.factor(neighbourhood_num) + 
    as.factor(roomtype_num) + accommodates + price + instant_bookable, 
    data = df_cleaned)

Residuals:
   Min     1Q Median     3Q    Max 
-4.125 -1.208 -0.465  0.420 56.063 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     1.8135725  0.4647861   3.902 9.62e-05 ***
covidTRUE                      -0.3726339  0.0650950  -5.724 1.08e-08 ***
as.factor(neighbourhood_num)2  -0.1691364  0.4187420  -0.404 0.686286    
as.factor(neighbourhood_num)3   0.5684251  0.4020637   1.414 0.157469    
as.factor(neighbourhood_num)4   1.8470679  0.5182352   3.564 0.000367 ***
as.factor(neighbourhood_num)5   0.1645501  0.4031250   0.408 0.683148    
as.factor(neighbourhood_num)6   0.2245008  0.4908163   0.457 0.647394    
as.factor(neighbourhood_num)7   0.0232328  0.4525262   0.051 0.959056    
as.factor(neighbourhood_num)8   0.3722813  0.4200383   0.886 0.375481    
as.factor(neighbourhood_num)9   0.2608471  0.4156185   0.628 0.530276    
as.factor(neighbourhood_num)10  0.0958234  0.4532241   0.211 0.832560    
as.factor(neighbourhood_num)11  0.0529102  0.4089200   0.129 0.897052    
as.factor(neighbourhood_num)12  0.3786304  0.8039026   0.471 0.637661    
as.factor(neighbourhood_num)13  0.2277393  0.5198490   0.438 0.661335    
as.factor(neighbourhood_num)14  0.1334606  0.3999017   0.334 0.738589    
as.factor(neighbourhood_num)15  0.1186491  0.3992311   0.297 0.766326    
as.factor(neighbourhood_num)16  1.4707543  0.5527789   2.661 0.007815 ** 
as.factor(neighbourhood_num)17  0.4515336  0.4314885   1.046 0.295383    
as.factor(neighbourhood_num)18  0.0214983  0.4353634   0.049 0.960618    
as.factor(neighbourhood_num)19 -0.7940233  0.5832449  -1.361 0.173430    
as.factor(neighbourhood_num)20  0.6924107  0.4121391   1.680 0.092989 .  
as.factor(neighbourhood_num)21  0.4857487  0.4348607   1.117 0.264019    
as.factor(neighbourhood_num)22  0.0988388  0.4109316   0.241 0.809931    
as.factor(roomtype_num)2        1.9593214  0.2433889   8.050 9.48e-16 ***
as.factor(roomtype_num)3        0.6605957  0.2421359   2.728 0.006382 ** 
as.factor(roomtype_num)4       -0.0888187  0.4806452  -0.185 0.853398    
accommodates                   -0.0129177  0.0260161  -0.497 0.619535    
price                          -0.0011770  0.0003356  -3.507 0.000456 ***
instant_bookableTRUE           -0.4890019  0.0768799  -6.361 2.12e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.81 on 7891 degrees of freedom
Multiple R-squared:  0.07123,   Adjusted R-squared:  0.06793 
F-statistic: 21.61 on 28 and 7891 DF,  p-value: < 2.2e-16

Robustness Checks

Before any conclusions can be drawn, we need to perform some robustness checks.

Independence (1)

# create a scatterplot of the residuals against the predicted values from the linear regression model 'm1'
plot(m1$fitted.values, m1$residuals, 
     xlab = "Fitted Values", ylab = "Residuals", 
     main = "Residuals vs. Fitted Values Plot",
     ylim = c(-50, 60))
abline(h = 0, lty = 2, col = 'red')

Independence (1)

A first option to check for independence, is to create a scatterplot of the residuals against the fitted values from the linear regression model (in this case m1). The residuals should be independent from the variable, but this scatterplot shows us that this is not the case. We can conclude that there is no independence of the residuals.

Independence (2)

# Create a scatterplot of predicted vs actual values
ggplot(df_cleaned, aes(x = predicted, y = minimum_nights)) +
  geom_point() + # adds points to the plot
  geom_abline(intercept = 0, slope = 1, color = "red") + # adds a diagonal line to the plot to visualize where predicted = actual
  xlab("Predicted Values") + # adds a label for the x-axis
  ylab("Actual Values") + # adds a label for the y-axis
  ggtitle("Predicted vs Actual Values Plot") # adds a title to the plot

Independence (2)

Next to that, we can create a scatterplot of the predicted values against the actual values.

Independence (3)

# perform Durbin-Watson test

Independence (3)

# perform Durbin-Watson test
dwtest(m1)

    Durbin-Watson test

data:  m1
DW = 0.1109, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

Independence (3)

# perform Durbin-Watson test
dwtest(m1)

    Durbin-Watson test

data:  m1
DW = 0.1109, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

TYPE HIER: a last thing we can do to check for independence is performing a Durbin-Watson test…

Homoskedasticity (1)

We can use the same plots when checking for homoskedasticity, as for checking independence.

Homoskedasticity (2)

# perform Breusch-Pagan test

Homoskedasticity (2)

# perform Breusch-Pagan test
bptest(m1)

    studentized Breusch-Pagan test

data:  m1
BP = 93.6, df = 28, p-value = 5.38e-09

Homoskedasticity (2)

# perform Breusch-Pagan test
bptest(m1)

    studentized Breusch-Pagan test

data:  m1
BP = 93.6, df = 28, p-value = 5.38e-09

TYPE HIER: a last thing we can do to check for homoskedasticity is performing a Breusch-Pagan test…

Normality (1)

## Making a dataframe with the residuals
residuals <- resid(m1)
residuals_df <- data.frame(residuals = residuals)

# Test for normality of residuals with a histogram
ggplot(residuals_df, aes(x = residuals)) + 
  geom_histogram(binwidth = 0.5, color = "black", fill = "white") + 
  xlab("Residuals") + ylab("Frequency") +
  ggtitle("Histogram of Residuals")

Normality (1)

TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is

Normality (2)

# Test for normality of residuals with a density plot
ggdensity(residuals_df$residuals, 
          main = "Density plot of residuals",
          xlab = "residuals")

Normality (2)

TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is

Normality (3)

# Test for normality with a Q-Q plot
qqnorm(residuals)
qqline(residuals)

Normality (3)

TYPE HIER: Type hier waarom de graph laat zien dat er geen independence is

Normality (4)

# Create random subsample of 5000 observations, so we are able to run a Shapiro-Wilk normality test (5000 is the maximum sample size)
set.seed(123)
my_subsample <- residuals_df[sample(nrow(residuals_df), 5000), ]
shapiro.test(my_subsample)

    Shapiro-Wilk normality test

data:  my_subsample
W = 0.48635, p-value < 2.2e-16

TYPE HIER: a last thing we can do to check for normality is performing a Shapiro-Wilk normality test…

Linearity (1)

We can use the same first plot as used for testing independence and homoskedasticity. We can conclude…

Multicollinearity (1)

# VIF test 
library(car)
vif(m1)
                                 GVIF Df GVIF^(1/(2*Df))
covid                        1.062619  1        1.030834
as.factor(neighbourhood_num) 1.263242 21        1.005579
as.factor(roomtype_num)      1.417333  3        1.059852
accommodates                 1.508334  1        1.228142
price                        1.623853  1        1.274305
instant_bookable             1.189423  1        1.090607

TYPE HIER: one thing we can do to check for multicolinearity is calculating VIFs…

Multicollinearity (2)

# correlation matrix

Multicollinearity (2)

# correlation matrix
cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")])
                         covid neighbourhood_num roomtype_num accommodates
covid              1.000000000      -0.006718571  -0.01926055   0.01126654
neighbourhood_num -0.006718571       1.000000000  -0.06783638   0.01764622
roomtype_num      -0.019260549      -0.067836379   1.00000000  -0.21649496
accommodates       0.011266540       0.017646217  -0.21649496   1.00000000
price             -0.164739514       0.007828189  -0.28513626   0.51640989
instant_bookable   0.112195055      -0.061621148   0.24110291  -0.06298997
                         price instant_bookable
covid             -0.164739514       0.11219506
neighbourhood_num  0.007828189      -0.06162115
roomtype_num      -0.285136264       0.24110291
accommodates       0.516409886      -0.06298997
price              1.000000000      -0.11301022
instant_bookable  -0.113010224       1.00000000

Multicollinearity (2)

# correlation matrix
cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")])
                         covid neighbourhood_num roomtype_num accommodates
covid              1.000000000      -0.006718571  -0.01926055   0.01126654
neighbourhood_num -0.006718571       1.000000000  -0.06783638   0.01764622
roomtype_num      -0.019260549      -0.067836379   1.00000000  -0.21649496
accommodates       0.011266540       0.017646217  -0.21649496   1.00000000
price             -0.164739514       0.007828189  -0.28513626   0.51640989
instant_bookable   0.112195055      -0.061621148   0.24110291  -0.06298997
                         price instant_bookable
covid             -0.164739514       0.11219506
neighbourhood_num  0.007828189      -0.06162115
roomtype_num      -0.285136264       0.24110291
accommodates       0.516409886      -0.06298997
price              1.000000000      -0.11301022
instant_bookable  -0.113010224       1.00000000

TYPE HIER: one thing we can do to check for multicolinearity is making a correlation matrix…

Multicollinearity (3)

# eigenvalues and condition number 

Multicollinearity (3)

# eigenvalues and condition number 
eigen(cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]))$values
kappa(model.matrix(m1))
[1] 1.7869855 1.0974403 1.0378347 0.9427096 0.6897544 0.4452754
[1] 51687.31

Multicollinearity (3)

# eigenvalues and condition number 
eigen(cor(df_cleaned[c("covid", "neighbourhood_num", "roomtype_num", "accommodates", "price", "instant_bookable")]))$values
kappa(model.matrix(m1))
[1] 1.7869855 1.0974403 1.0378347 0.9427096 0.6897544 0.4452754
[1] 51687.31

TYPE HIER: a last thing we can do to check for multicolinearity is calculating eigenvalues and condition number…

Conclusion

Conclusion

TYPE HIER: Type the conclusion here